Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample tk data #73

Merged
merged 11 commits into from
Nov 5, 2021
Merged

Sample tk data #73

merged 11 commits into from
Nov 5, 2021

Conversation

lizgzil
Copy link
Contributor

@lizgzil lizgzil commented Oct 26, 2021


Fixing #68

  • find and save a sample of the text kernel data (get_tk_sample.py)
  • update predict sentence class to be able to use data from a pre-found sample (rather than sampling within this script). Also works for old method too.
  • update readmes

This samples 5 million job adverts randomly. Thus when the skill sentence predictions are outputted, although in sum there are more - each file contains less data. Previously 100 random files were selected and the first 10k job adverts were processed. Now there is data from 647 of these files, and a random selection from each (which is less that 10k).

Timing

Each file goes through the same algorithm independently, and takes roughly 15-20minutes. These are the timings for each step of the algorithm for 1 data file out of 647:

2021-10-26 17:34:02,278 - __main__ - INFO - Loading data from inputs/data/textkernel-files/historical/2020/2020-03-11/jobs_2.110.jsonl.gz ...
2021-10-26 17:34:18,424 - __main__ - INFO - Splitting sentences ...
2021-10-26 17:39:30,769 - __main__ - INFO - Splitting sentences took 312.3075065612793 seconds
2021-10-26 17:39:30,795 - __main__ - INFO - Processing sentences took 312.3337023258209 seconds
2021-10-26 17:39:30,795 - __main__ - INFO - Transforming skill sentences ...
Getting embeddings for 168906 texts ...
.. with multiprocessing
2021-10-26 17:39:30,795 - sentence_transformers.SentenceTransformer - INFO - Start multi-process pool on devices: cuda:0
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ubuntu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
2021-10-26 17:39:33,602 - sentence_transformers.SentenceTransformer - INFO - Chunk data into packages of size 5000
Took 643.1803483963013 seconds
2021-10-26 17:53:31,097 - __main__ - INFO - Chunking up sentences ...
2021-10-26 17:53:31,248 - __main__ - INFO - Chunking up sentences into 169 chunks took 0.1508171558380127 seconds
2021-10-26 17:53:31,248 - __main__ - INFO - Predicting skill sentences ...
2021-10-26 17:54:07,619 - __main__ - INFO - Predicting on 168906 sentences took 36.37088871002197 seconds
2021-10-26 17:54:07,619 - __main__ - INFO - Combining data for output ...
2021-10-26 17:54:07,675 - __main__ - INFO - Combining output took 0.05558419227600098 seconds
2021-10-26 17:54:07,675 - __main__ - INFO - Saving data to outputs/sentence_classifier/data/skill_sentences/2021.10.26/textkernel-files/historical/2020/2020-03-11/jobs_2.110_2021.08.16.json ...
2021-10-26 17:54:08,017 - skills_taxonomy_v2.getters.s3_data - INFO - Saved to s3://skills-taxonomy-v2 + outputs/sentence_classifier/data/skill_sentences/2021.10.26/textkernel-files/historical/2020/2020-03-11/jobs_2.110_2021.08.16.json ...

9871 job adverts were in the "historical/2020/2020-03-11/jobs_2.110.jsonl.gz" file.

We would expect an average of 5000000/647 = 7728 of the sampled job adverts in each file. So this file seems to have a particularly large sample of job adverts in.

There will be different numbers of sentences in each job advert, but scaling that up it means that 20mins647 = 9 days. or 20mins(5000000/9871) = 7 days.

Target areas for speeding up!

  1. Around a half of the processing is spent transforming the sentences using the BERT model.

The biggest sticking point is in the transforming of the sentences using the pre-trained BERT model (even when using multiprocessing), i.e. in this function

. This is called in this PR here.

So could this step be done better in order to speed up that area of the pipeline?

  1. Around a quarter of the processing is spent splitting the sentences.

The second biggest time lag is splitting the text up into sentences. This is done via the function split_sentence - in this PR this function is called here

… sampled data dates as compared to all data, and finally some tweaks to the predict sentence class script to process this new form of data
with Pool(4) as pool: # 4 cpus
partial_split_sentence = partial(split_sentence, nlp=nlp, min_length=30)
split_sentence_pool_output = pool.map(partial_split_sentence, data)
logger.info(f"Splitting sentences took {time.time() - start_time} seconds")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this takes 5 minutes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now down to <5 seconds thanks to comments by @jaklinger


if sentences:
logger.info(f"Transforming skill sentences ...")
sentences_vec = sent_classifier.transform(sentences)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this takes about 10 minutes

@jaklinger
Copy link

jaklinger commented Oct 27, 2021

First note (another coming I think, but have a meeting now!): nltk's sent_tokenize is 10-100x faster than nlp(...). This should bring us from 7-9 days to 4-6 days 😄

@jaklinger
Copy link

Second note, also applying to the sentence processing is that quite often there is an overhead in creating threads, and so rather than doing 10000 operations over 4 cores in 2500 threads, you can do 4 x 2500 operations over 4 cores in 4 threads. In general, a more practical way to do this is by splitting into chunks and then flattening the output. Potentially here you will make a saving of another factor of 10 on the sentence splitting

def split_sentence_over_chunk(chunk, nlp, min_length):
    partial_split_sentence = partial(split_sentence, nlp=nlp, min_length=min_length)
    return list(map(partial_split_sentence, chunk))

def make_chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

...

with Pool(4) as pool:  # 4 cpus
    chunks = make_chunks(data, 1000)  # chunks of 1000s sentences
    partial_split_sentence = partial(split_sentence_over_chunk, nlp=nlp, min_length=30)
    # NB the output will be a list of lists, so make sure to flatten after this!
    split_sentence_pool_output = pool.map(partial_split_sentence, chunks)

@jaklinger
Copy link

jaklinger commented Oct 27, 2021

General comment: you could get a speed up of around 100 by switching the pipeline to metaflow + batch with max-workers=100 whilst splitting the embeddings up into chunks. Something like (note, just pseudo-code) which would fan-out over files and then fan-out again over sentence chunks, and then in the end save some data either locally, to S3 or as an S3 artefact, over which you then do your analytic step.

I suspect that this would take just a couple of hours to run for your whole dataset, so even if it would take 5 days to write it would still be worth it, not taking into account additional development cycles of batches of 5 days 😄

class SentenceFlow(FlowSpec):
	@step
	def start(self):
		self.file_names = job_ad_file_names
		self.next(self.process_sentences, foreach="file_names")

	@batch()
	@step
	def process_sentences(self):
		self.file_name = self.input
        sentence_data = get_sentences(self.file_name)  # a list of dict
        self.chunks = make_chunks(sentence_data)
        self.next(self.embedding_chunks, foreach="chunks")

	@batch()
	@step
	def embedding_chunks(self):
        # save on memory with while/pop
        texts, ids = [], []
        while self.input:
			row = self.input.pop(0)
            texts.append(row['text'])
            ids.append(row['ids'])
        bert_model = SentenceTransformer(bert_model_name)
        bert_model.max_seq_length = 512
        vecs = bert_model.encode(texts)
        self.data = list(*zip(ids, vecs))
        self.next(self.join_embedding_chunks)

	@step
	def join_embedding_chunks(self, inputs):
        self.data = []
		for input in inputs:
             self.data += input.data
        self.next(self.process_sentences)

... etc ...

@lizgzil
Copy link
Contributor Author

lizgzil commented Oct 27, 2021

First note (another coming I think, but have a meeting now!): nltk's sent_tokenize is 10-100x faster than nlp(...). This should bring us from 7-9 days to 4-6 days 😄

whoa! ok this was much better. Went from 25 secs to 3 secs (on 100 job adverts)

@lizgzil lizgzil marked this pull request as ready for review October 28, 2021 14:53
@lizgzil
Copy link
Contributor Author

lizgzil commented Nov 4, 2021

After making some changes, the code actually just took 4.5 days to run

@lizgzil
Copy link
Contributor Author

lizgzil commented Nov 5, 2021

There are some files included in the sample which don't contain the full text metadata. This leaves us with 4,312,285 job adverts job adverts from the following distribution over time (in comparison to all the data files minus the ones without the full text metadata)
tk_sample_dates_no_expired

@lizgzil lizgzil closed this Nov 5, 2021
@lizgzil lizgzil reopened this Nov 5, 2021
@lizgzil lizgzil merged commit 078e135 into dev Nov 5, 2021
@lizgzil lizgzil deleted the sample_tk_data branch November 5, 2021 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants